Skip to content

Multinode evals#1000

Open
Oseltamivir wants to merge 29 commits intomainfrom
multinode_eval
Open

Multinode evals#1000
Oseltamivir wants to merge 29 commits intomainfrom
multinode_eval

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

Summary

Add eval-only support for multi-node benchmarks and wire those eval results into CI collection + summary reporting.

This covers:

  • eval matrix selection for multi-node configs
  • eval-only workflow jobs for multi-node sweeps
  • AMD MI355X eval execution in server.sh
  • NVIDIA Slurm eval execution through Oseltamivir's srt-slurm fork
  • eval artifact upload, score validation, and multi-node-aware summary tables

How evals are run

Single-node evals are selected on 8k1k at max + median concurrency for each (model, runner, framework, precision, spec-decoding, dp-attn) group.

Multi-node evals are selected on 8k1k by taking the entry with the highest max concurrency for each (model, runner, framework, precision, spec-decoding, prefill-dp-attn, decode-dp-attn) group, then running eval at the
median concurrency from that config via eval-conc.

EVAL_ONLY=true starts the server with expanded eval context, skips throughput benchmarking, runs lm-eval,
writes meta_env.json + results*.json + sample*.jsonl, uploads those artifacts, then validates scores
against thresholds.

srt-slurm fork delta vs upstream

NVIDIA multinode eval uses Oseltamivir/srt-slurm@sa-submission-q1-2026 instead of ishandhanani/srt-slurm.

Compared with current upstream/main, that fork adds the eval path InferenceX needs:

  • a new lm-eval benchmark runner
  • /infmax-workspace mounting via INFMAX_WORKSPACE
  • EVAL_ONLY support in do_sweep.py to skip benchmark stage and run post-eval directly
  • full wait_for_model() health checking before eval in eval-only mode
  • pass-through of framework/model/topology/env metadata into the eval container
  • MODEL_NAME=self.config.served_model_name so eval queries the served alias, not the HF repo id
  • EVAL_CONC from workflow to EVAL_CONCURRENT_REQUESTS
  • copying eval outputs into /logs/eval_results/ for launcher-side artifact pickup

Validation

Run Scope Result
23888824506 B300 eval-only 8/8 jobs passed, GSM8K 94.69-96.13
23802423939 GB300 eval-only 12/12 jobs passed, GSM8K 89.92-95.98
23882945894 B200 multinode eval-only 17/18 jobs passed; 1 dynamo-sglang FP8 MTP job hit server health timeout
23909140268 H200 multinode eval-only 4/8 jobs passed; dynamo-trt passed, dynamo-sglang jobs failed before Slurm log creation, issue raised
23800447228 MI355X multinode eval-only no-DPA rows passed GSM8K 95.68-96.66; DPA=true rows failed at 0.0, issue raised

Oseltamivir and others added 29 commits February 19, 2026 16:12
The sglang 0.5.8 Docker image ships a newer lm-eval 0.4.9.2 commit
that defaults fewshot_as_multiturn=True for chat-completion models.
Since the version string matches the pinned commit, pip silently
skips the install. Adding --force-reinstall ensures the pinned
commit is always used regardless of what's pre-installed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds dsr1-fp8-mi355x-sglang-disagg-nodpa-eval: same image/model/precision
as the DPA config but with dp-attn=false and ep=1. Running evals on this
will tell us if DPA is the cause of the 0% GSM8K score or if it's
something else about the fp8 disagg setup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Oseltamivir Oseltamivir requested a review from a team April 3, 2026 04:01
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

@functionstackx
Copy link
Copy Markdown
Contributor

@Oseltamivir can u upstream ur changes to srt-slurm

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a large infrastructure PR (22 files) touching multi-node CI/CD workflows, all NVIDIA runner scripts, AMD server logic, and Python result collection — including a switch from the ishandhanani/srt-slurm fork to Oseltamivir/srt-slurm across all NVIDIA runners, which warrants a human look.

Extended reasoning...

Overview

The PR adds eval-only support for multi-node benchmarks, touching GitHub Actions workflows (benchmark-multinode-tmpl.yml, e2e-tests.yml, run-sweep.yml), all six NVIDIA Slurm runner scripts, the AMD MI355X server.sh/job.slurm/submit.sh, shared benchmark_lib.sh, and Python utilities for config generation and result collection.

Security risks

The most notable concern is the switch from ishandhanani/srt-slurm to Oseltamivir/srt-slurm@sa-submission-q1-2026 across all NVIDIA multi-node runners. This changes the external code being cloned and executed on production cluster runners. While the PR description enumerates the fork's delta vs upstream, a human should validate the trust decision of pinning to this fork at this branch.

Level of scrutiny

High scrutiny is warranted. This PR touches production CI/CD infrastructure across multiple hardware platforms, introduces a new external dependency fork, and the PR description itself documents known partial failures (H200 dynamo-sglang jobs failing, MI355X DPA=true rows failing at 0.0). These open issues suggest the eval path is not fully stable across all targets yet.

Other factors

The no-bugs finding from the automated system is reassuring for logic correctness, but the scope (22 files, new workflow job types, eval artifact pipeline, split summary tables) and the documented known failures make this a PR that should have at least one human reviewer before merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants